Executive Summary

DATA 2002: Data Analytics- Learning from Data is an intermediate unit of study at the University of Sydney. The unit aims to equip students with knowledge and skills that will enable them to embrace data analytic challenges stemming from everyday problems.

This report seeks to identify a good classifier for spam vs non-spam messages and report on its performance (in-sample and out-of-sample).

Three methods were performed namely decision tree/ random forest method, a logistic regression and nearest neighbours approach. This report found that the logistic regression is the best classifier of the approaches performed in the report.

Introduction

DATA 2002: Data Analytics- Learning from Data is an intermediate unit of study at the University of Sydney. The unit aims to equip students with knowledge and skills that will enable them to embrace data analytic challenges stemming from everyday problems.

As part of semester 2 2018 assessment of the unit of study, students are required to identify a good classifier for spam vs non-spam emails and report on its performance (in-sample and out-of-sample).

Data Overview

The data has 4601 messages with 58 different variables, the objective is to try to predict whether the email was junk email or ‘spam’.

Data Import

The data can be found here https://archive.ics.uci.edu/ml/datasets/spambase (which gives quite a lot of background information about the data). It is also available in the kernlab package, which is perhaps the simplest way to load the data into R:

library(lattice)
library(ggplot2)
library(caret)
library(rpart)
library(rpart.plot)
library(partykit)
library(randomForest)
library(class)
library(cvTools)
library(stargazer)
data(spam, package = "kernlab")
s = spam
t = spam
glimpse(spam)
## Observations: 4,601
## Variables: 58
## $ make              <dbl> 0.00, 0.21, 0.06, 0.00, 0.00, 0.00, 0.00, 0....
## $ address           <dbl> 0.64, 0.28, 0.00, 0.00, 0.00, 0.00, 0.00, 0....
## $ all               <dbl> 0.64, 0.50, 0.71, 0.00, 0.00, 0.00, 0.00, 0....
## $ num3d             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ our               <dbl> 0.32, 0.14, 1.23, 0.63, 0.63, 1.85, 1.92, 1....
## $ over              <dbl> 0.00, 0.28, 0.19, 0.00, 0.00, 0.00, 0.00, 0....
## $ remove            <dbl> 0.00, 0.21, 0.19, 0.31, 0.31, 0.00, 0.00, 0....
## $ internet          <dbl> 0.00, 0.07, 0.12, 0.63, 0.63, 1.85, 0.00, 1....
## $ order             <dbl> 0.00, 0.00, 0.64, 0.31, 0.31, 0.00, 0.00, 0....
## $ mail              <dbl> 0.00, 0.94, 0.25, 0.63, 0.63, 0.00, 0.64, 0....
## $ receive           <dbl> 0.00, 0.21, 0.38, 0.31, 0.31, 0.00, 0.96, 0....
## $ will              <dbl> 0.64, 0.79, 0.45, 0.31, 0.31, 0.00, 1.28, 0....
## $ people            <dbl> 0.00, 0.65, 0.12, 0.31, 0.31, 0.00, 0.00, 0....
## $ report            <dbl> 0.00, 0.21, 0.00, 0.00, 0.00, 0.00, 0.00, 0....
## $ addresses         <dbl> 0.00, 0.14, 1.75, 0.00, 0.00, 0.00, 0.00, 0....
## $ free              <dbl> 0.32, 0.14, 0.06, 0.31, 0.31, 0.00, 0.96, 0....
## $ business          <dbl> 0.00, 0.07, 0.06, 0.00, 0.00, 0.00, 0.00, 0....
## $ email             <dbl> 1.29, 0.28, 1.03, 0.00, 0.00, 0.00, 0.32, 0....
## $ you               <dbl> 1.93, 3.47, 1.36, 3.18, 3.18, 0.00, 3.85, 0....
## $ credit            <dbl> 0.00, 0.00, 0.32, 0.00, 0.00, 0.00, 0.00, 0....
## $ your              <dbl> 0.96, 1.59, 0.51, 0.31, 0.31, 0.00, 0.64, 0....
## $ font              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ num000            <dbl> 0.00, 0.43, 1.16, 0.00, 0.00, 0.00, 0.00, 0....
## $ money             <dbl> 0.00, 0.43, 0.06, 0.00, 0.00, 0.00, 0.00, 0....
## $ hp                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hpl               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ george            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ num650            <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0....
## $ lab               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ labs              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ telnet            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ num857            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ data              <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0....
## $ num415            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ num85             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ technology        <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0....
## $ num1999           <dbl> 0.00, 0.07, 0.00, 0.00, 0.00, 0.00, 0.00, 0....
## $ parts             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ pm                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ direct            <dbl> 0.00, 0.00, 0.06, 0.00, 0.00, 0.00, 0.00, 0....
## $ cs                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ meeting           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ original          <dbl> 0.00, 0.00, 0.12, 0.00, 0.00, 0.00, 0.00, 0....
## $ project           <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0....
## $ re                <dbl> 0.00, 0.00, 0.06, 0.00, 0.00, 0.00, 0.00, 0....
## $ edu               <dbl> 0.00, 0.00, 0.06, 0.00, 0.00, 0.00, 0.00, 0....
## $ table             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ conference        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ charSemicolon     <dbl> 0.000, 0.000, 0.010, 0.000, 0.000, 0.000, 0....
## $ charRoundbracket  <dbl> 0.000, 0.132, 0.143, 0.137, 0.135, 0.223, 0....
## $ charSquarebracket <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0....
## $ charExclamation   <dbl> 0.778, 0.372, 0.276, 0.137, 0.135, 0.000, 0....
## $ charDollar        <dbl> 0.000, 0.180, 0.184, 0.000, 0.000, 0.000, 0....
## $ charHash          <dbl> 0.000, 0.048, 0.010, 0.000, 0.000, 0.000, 0....
## $ capitalAve        <dbl> 3.756, 5.114, 9.821, 3.537, 3.537, 3.000, 1....
## $ capitalLong       <dbl> 61, 101, 485, 40, 40, 15, 4, 11, 445, 43, 6,...
## $ capitalTotal      <dbl> 278, 1028, 2259, 191, 191, 54, 112, 49, 1257...
## $ type              <fct> spam, spam, spam, spam, spam, spam, spam, sp...

Results

Logistic Regression

Fit into a logistic regression

glm1 = glm(type ~., family = binomial, data = spam)

Perform AIC backward stepwise model selection

step.back.aic = step(glm1, direction = "backward", trace = FALSE)
stargazer::stargazer(glm1, step.back.aic, type = "html", column.labels = c("Full model","Stepwise model"))
Dependent variable:
type
Full model Stepwise model
(1) (2)
make -0.390* -0.469**
(0.231) (0.216)
address -0.146** -0.137**
(0.069) (0.065)
all 0.114
(0.110)
num3d 2.252 2.257
(1.507) (1.507)
our 0.562*** 0.566***
(0.102) (0.102)
over 0.883*** 0.825***
(0.250) (0.245)
remove 2.279*** 2.261***
(0.333) (0.327)
internet 0.570*** 0.565***
(0.168) (0.166)
order 0.734*** 0.668**
(0.285) (0.275)
mail 0.127* 0.116*
(0.073) (0.070)
receive -0.256
(0.298)
will -0.138* -0.136*
(0.074) (0.073)
people -0.080
(0.230)
report 0.145
(0.136)
addresses 1.236* 1.293*
(0.725) (0.703)
free 1.039*** 1.048***
(0.146) (0.145)
business 0.960*** 0.945***
(0.225) (0.221)
email 0.120
(0.117)
you 0.081** 0.090***
(0.035) (0.034)
credit 1.047* 1.117**
(0.538) (0.553)
your 0.242*** 0.233***
(0.052) (0.049)
font 0.201 0.221
(0.163) (0.165)
num000 2.245*** 2.193***
(0.471) (0.467)
money 0.426*** 0.442***
(0.162) (0.169)
hp -1.920*** -1.981***
(0.313) (0.313)
hpl -1.040** -1.036**
(0.440) (0.440)
george -11.767*** -11.220***
(2.113) (1.795)
num650 0.445** 0.418**
(0.199) (0.199)
lab -2.486* -2.525*
(1.502) (1.525)
labs -0.330
(0.314)
telnet -0.170
(0.482)
num857 2.549
(3.283)
data -0.738** -0.730**
(0.312) (0.308)
num415 0.668
(1.601)
num85 -2.055*** -2.137***
(0.788) (0.783)
technology 0.924*** 0.964***
(0.309) (0.309)
num1999 0.047
(0.175)
parts -0.597 -0.606
(0.423) (0.427)
pm -0.865** -0.867**
(0.383) (0.383)
direct -0.305
(0.364)
cs -45.048* -44.200*
(26.598) (26.427)
meeting -2.689*** -2.690***
(0.838) (0.845)
original -1.247 -1.274
(0.806) (0.823)
project -1.573*** -1.619***
(0.529) (0.535)
re -0.792*** -0.796***
(0.156) (0.155)
edu -1.459*** -1.466***
(0.269) (0.268)
table -2.326 -2.356
(1.659) (1.793)
conference -4.016** -4.033***
(1.611) (1.564)
charSemicolon -1.291*** -1.309***
(0.442) (0.447)
charRoundbracket -0.188
(0.249)
charSquarebracket -0.657
(0.838)
charExclamation 0.347*** 0.359***
(0.089) (0.091)
charDollar 5.336*** 5.481***
(0.706) (0.706)
charHash 2.403** 2.202**
(1.113) (1.073)
capitalAve 0.012
(0.019)
capitalLong 0.009*** 0.010***
(0.003) (0.002)
capitalTotal 0.001*** 0.001***
(0.0002) (0.0002)
Constant -1.569*** -1.552***
(0.142) (0.128)
Observations 4,601 4,601
Log Likelihood -907.883 -912.438
Akaike Inf. Crit. 1,931.765 1,912.876
Note: p<0.1; p<0.05; p<0.01

Generate confusion matrix to assess the in-sample accuracy of the predictions from the stepwise model

preds = as.factor(round(predict(step.back.aic, type = "response")))
preds <- as.character(preds)
preds[preds == "0"] <- "nonspam"
preds[preds == "1"] <- "spam"
preds <- as.factor(preds)
truth = as.factor(spam$type)
confusionMatrix(data = preds, reference = truth)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam    2669  193
##    spam        119 1620
##                                           
##                Accuracy : 0.9322          
##                  95% CI : (0.9245, 0.9393)
##     No Information Rate : 0.606           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.857           
##  Mcnemar's Test P-Value : 3.584e-05       
##                                           
##             Sensitivity : 0.9573          
##             Specificity : 0.8935          
##          Pos Pred Value : 0.9326          
##          Neg Pred Value : 0.9316          
##              Prevalence : 0.6060          
##          Detection Rate : 0.5801          
##    Detection Prevalence : 0.6220          
##       Balanced Accuracy : 0.9254          
##                                           
##        'Positive' Class : nonspam         
## 

Therefore the percentage of predicted the correct categories is 93.2%. It has a 4.3% of wrongly classifying a nonspam email as a spam.

Perform 5 fold cross-validation to get a sample accuracy for the stepwise model.

set.seed(1)
spam$pred[step.back.aic$fitted.values >= 0.5] = "spam"
spam$pred[step.back.aic$fitted.values < 0.5] = "nonspam"

a = table(spam$pred, spam$type)

table(spam$type)[1] / dim(spam)[1]
##   nonspam 
## 0.6059552
mean(spam$type == spam$pred)
## [1] 0.9321887
a[2, 1] / (sum(a[, 1]))
## [1] 0.04268293
b=train(step.back.aic$formula,
      data = spam, 
      method = "glm",
      family = "binomial",
      trControl = trainControl(
        method = "cv", number = 5,
        verboseIter = FALSE
      ))
b
## Generalized Linear Model 
## 
## 4601 samples
##   43 predictor
##    2 classes: 'nonspam', 'spam' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3680, 3680, 3682, 3680, 3682 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9297998  0.8520847

The out of sample gives the accuracy of 93% which is just very slightly lower than the in-sample.

Decision Tree

Creating a tree classifier with 1% for the complexity parameter (default)

tree = rpart(factor(type) ~ ., data = s, method = "class")

Visualising the trees (2 different layouts)

rpart.plot(tree)

plot(as.party(tree))

summary(tree)
## Call:
## rpart(formula = factor(type) ~ ., data = s, method = "class")
##   n= 4601 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.47655819      0 1.0000000 1.0000000 0.01828190
## 2 0.14892443      1 0.5234418 0.5499173 0.01541402
## 3 0.04302261      2 0.3745174 0.4473249 0.01425628
## 4 0.03088803      4 0.2884721 0.3331495 0.01263460
## 5 0.01047987      5 0.2575841 0.2923331 0.01194441
## 6 0.01000000      6 0.2471042 0.2719250 0.01157217
## 
## Variable importance
##       charDollar           remove           num000            money 
##               29               13               10                9 
##  charExclamation      capitalLong           credit            order 
##                7                7                5                4 
##               hp     capitalTotal              hpl       capitalAve 
##                4                3                2                1 
##             your             free charRoundbracket              our 
##                1                1                1                1 
##           telnet 
##                1 
## 
## Node number 1: 4601 observations,    complexity param=0.4765582
##   predicted class=nonspam  expected loss=0.3940448  P(node) =1
##     class counts:  2788  1813
##    probabilities: 0.606 0.394 
##   left son=2 (3471 obs) right son=3 (1130 obs)
##   Primary splits:
##       charDollar      < 0.0555 to the left,  improve=714.1697, (0 missing)
##       charExclamation < 0.0795 to the left,  improve=711.9638, (0 missing)
##       remove          < 0.01   to the left,  improve=597.8504, (0 missing)
##       free            < 0.095  to the left,  improve=559.6634, (0 missing)
##       your            < 0.605  to the left,  improve=543.2496, (0 missing)
##   Surrogate splits:
##       num000      < 0.055  to the left,  agree=0.839, adj=0.346, (0 split)
##       money       < 0.045  to the left,  agree=0.833, adj=0.321, (0 split)
##       credit      < 0.025  to the left,  agree=0.796, adj=0.169, (0 split)
##       capitalLong < 71.5   to the left,  agree=0.793, adj=0.158, (0 split)
##       order       < 0.18   to the left,  agree=0.792, adj=0.155, (0 split)
## 
## Node number 2: 3471 observations,    complexity param=0.1489244
##   predicted class=nonspam  expected loss=0.2350908  P(node) =0.7544012
##     class counts:  2655   816
##    probabilities: 0.765 0.235 
##   left son=4 (3141 obs) right son=5 (330 obs)
##   Primary splits:
##       remove          < 0.055  to the left,  improve=331.3223, (0 missing)
##       charExclamation < 0.0915 to the left,  improve=284.6134, (0 missing)
##       free            < 0.135  to the left,  improve=266.0164, (0 missing)
##       your            < 0.615  to the left,  improve=165.9929, (0 missing)
##       capitalAve      < 3.6835 to the left,  improve=158.6464, (0 missing)
##   Surrogate splits:
##       capitalLong < 131.5  to the left,  agree=0.909, adj=0.045, (0 split)
##       charHash    < 0.8325 to the left,  agree=0.906, adj=0.012, (0 split)
##       num3d       < 7.125  to the left,  agree=0.906, adj=0.009, (0 split)
##       business    < 4.325  to the left,  agree=0.906, adj=0.009, (0 split)
##       credit      < 1.635  to the left,  agree=0.906, adj=0.006, (0 split)
## 
## Node number 3: 1130 observations,    complexity param=0.03088803
##   predicted class=spam     expected loss=0.1176991  P(node) =0.2455988
##     class counts:   133   997
##    probabilities: 0.118 0.882 
##   left son=6 (70 obs) right son=7 (1060 obs)
##   Primary splits:
##       hp              < 0.4    to the right, improve=91.33732, (0 missing)
##       hpl             < 0.12   to the right, improve=44.47552, (0 missing)
##       charExclamation < 0.0495 to the left,  improve=40.43106, (0 missing)
##       num1999         < 0.085  to the right, improve=35.90036, (0 missing)
##       george          < 0.21   to the right, improve=34.65602, (0 missing)
##   Surrogate splits:
##       hpl    < 0.31   to the right, agree=0.965, adj=0.429, (0 split)
##       telnet < 0.045  to the right, agree=0.950, adj=0.186, (0 split)
##       num650 < 0.025  to the right, agree=0.946, adj=0.129, (0 split)
##       george < 0.225  to the right, agree=0.945, adj=0.114, (0 split)
##       lab    < 0.08   to the right, agree=0.945, adj=0.114, (0 split)
## 
## Node number 4: 3141 observations,    complexity param=0.04302261
##   predicted class=nonspam  expected loss=0.1642789  P(node) =0.6826777
##     class counts:  2625   516
##    probabilities: 0.836 0.164 
##   left son=8 (2737 obs) right son=9 (404 obs)
##   Primary splits:
##       charExclamation < 0.378  to the left,  improve=173.25510, (0 missing)
##       free            < 0.2    to the left,  improve=152.11900, (0 missing)
##       capitalAve      < 3.638  to the left,  improve= 79.00492, (0 missing)
##       your            < 0.865  to the left,  improve= 69.83959, (0 missing)
##       hp              < 0.025  to the right, improve= 64.00030, (0 missing)
##   Surrogate splits:
##       num000   < 0.62   to the left,  agree=0.875, adj=0.030, (0 split)
##       free     < 2.415  to the left,  agree=0.875, adj=0.027, (0 split)
##       money    < 3.305  to the left,  agree=0.872, adj=0.007, (0 split)
##       business < 1.305  to the left,  agree=0.872, adj=0.005, (0 split)
##       order    < 2.335  to the left,  agree=0.872, adj=0.002, (0 split)
## 
## Node number 5: 330 observations
##   predicted class=spam     expected loss=0.09090909  P(node) =0.07172354
##     class counts:    30   300
##    probabilities: 0.091 0.909 
## 
## Node number 6: 70 observations
##   predicted class=nonspam  expected loss=0.1  P(node) =0.01521408
##     class counts:    63     7
##    probabilities: 0.900 0.100 
## 
## Node number 7: 1060 observations
##   predicted class=spam     expected loss=0.06603774  P(node) =0.2303847
##     class counts:    70   990
##    probabilities: 0.066 0.934 
## 
## Node number 8: 2737 observations
##   predicted class=nonspam  expected loss=0.100475  P(node) =0.5948707
##     class counts:  2462   275
##    probabilities: 0.900 0.100 
## 
## Node number 9: 404 observations,    complexity param=0.04302261
##   predicted class=spam     expected loss=0.4034653  P(node) =0.087807
##     class counts:   163   241
##    probabilities: 0.403 0.597 
##   left son=18 (182 obs) right son=19 (222 obs)
##   Primary splits:
##       capitalTotal < 55.5   to the left,  improve=63.99539, (0 missing)
##       capitalLong  < 10.5   to the left,  improve=54.95790, (0 missing)
##       capitalAve   < 2.654  to the left,  improve=53.67847, (0 missing)
##       free         < 0.04   to the left,  improve=40.70414, (0 missing)
##       our          < 0.065  to the left,  improve=25.38181, (0 missing)
##   Surrogate splits:
##       capitalLong      < 12.5   to the left,  agree=0.856, adj=0.681, (0 split)
##       capitalAve       < 2.805  to the left,  agree=0.757, adj=0.462, (0 split)
##       your             < 0.115  to the left,  agree=0.738, adj=0.418, (0 split)
##       charRoundbracket < 0.008  to the left,  agree=0.693, adj=0.319, (0 split)
##       our              < 0.065  to the left,  agree=0.673, adj=0.275, (0 split)
## 
## Node number 18: 182 observations,    complexity param=0.01047987
##   predicted class=nonspam  expected loss=0.2857143  P(node) =0.03955662
##     class counts:   130    52
##    probabilities: 0.714 0.286 
##   left son=36 (161 obs) right son=37 (21 obs)
##   Primary splits:
##       free            < 0.845  to the left,  improve=21.101450, (0 missing)
##       capitalAve      < 2.654  to the left,  improve=13.432050, (0 missing)
##       charExclamation < 0.8045 to the left,  improve=10.648500, (0 missing)
##       capitalLong     < 8.5    to the left,  improve= 6.991597, (0 missing)
##       re              < 0.23   to the right, improve= 6.714619, (0 missing)
##   Surrogate splits:
##       capitalAve    < 3.871  to the left,  agree=0.912, adj=0.238, (0 split)
##       charSemicolon < 0.294  to the left,  agree=0.907, adj=0.190, (0 split)
##       email         < 3.84   to the left,  agree=0.896, adj=0.095, (0 split)
##       our           < 2.345  to the left,  agree=0.890, adj=0.048, (0 split)
##       capitalLong   < 25     to the left,  agree=0.890, adj=0.048, (0 split)
## 
## Node number 19: 222 observations
##   predicted class=spam     expected loss=0.1486486  P(node) =0.04825038
##     class counts:    33   189
##    probabilities: 0.149 0.851 
## 
## Node number 36: 161 observations
##   predicted class=nonspam  expected loss=0.1987578  P(node) =0.03499239
##     class counts:   129    32
##    probabilities: 0.801 0.199 
## 
## Node number 37: 21 observations
##   predicted class=spam     expected loss=0.04761905  P(node) =0.004564225
##     class counts:     1    20
##    probabilities: 0.048 0.952

In-sample Performance:

type_pred = predict(tree, type = "class")
confusionMatrix(
  data = type_pred,
  reference = s$type)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam    2654  314
##    spam        134 1499
##                                          
##                Accuracy : 0.9026         
##                  95% CI : (0.8937, 0.911)
##     No Information Rate : 0.606          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7925         
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9519         
##             Specificity : 0.8268         
##          Pos Pred Value : 0.8942         
##          Neg Pred Value : 0.9179         
##              Prevalence : 0.6060         
##          Detection Rate : 0.5768         
##    Detection Prevalence : 0.6451         
##       Balanced Accuracy : 0.8894         
##                                          
##        'Positive' Class : nonspam        
## 

The accuracy of our tree is 90.3%. It has a 4.8% of wrongly classifying a nonspam email as a spam.

Performance Benchmarking:

  • Selecting the worst possible accuracy.
table(spam$type)
## 
## nonspam    spam 
##    2788    1813
benchmark = 1813/(2788+1813)
benchmark
## [1] 0.3940448

By assuming that the benchmark model will predict that all emails are spams, we have will achieve a 39.4% accuracy. Our previous tree model achieves a much higher accuracy of 90.3%.

Out-of-sample Performance:

The out-of-sample performance was done using 10 fold cross-validation.

train(type ~ ., data = s,
      method = "rpart", trControl = trainControl(method = "cv", number = 10))
## CART 
## 
## 4601 samples
##   57 predictor
##    2 classes: 'nonspam', 'spam' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4141, 4141, 4140, 4140, 4141, 4141, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.04302261  0.8632898  0.7103242
##   0.14892443  0.7906923  0.5528305
##   0.47655819  0.6710266  0.1965975
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.04302261.

The CV procedure suggests 4.3% for the complexity parameter. This gives the out of sample accuracy of 85.7% which is slightly worse than the in-sample. It appears that the decision tree is over-fitting slightly which drags down its out of sample performance.

tree2 = rpart(factor(type) ~ ., data = s, method = "class", control = rpart.control(cp = 0.043))
plot(as.party(tree2))

set.seed(2018)
type_pred2 = predict(tree2, type = "class")
confusionMatrix(
  data = type_pred2,
  reference = spam$type)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam    2592  327
##    spam        196 1486
##                                           
##                Accuracy : 0.8863          
##                  95% CI : (0.8768, 0.8954)
##     No Information Rate : 0.606           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7589          
##  Mcnemar's Test P-Value : 1.312e-08       
##                                           
##             Sensitivity : 0.9297          
##             Specificity : 0.8196          
##          Pos Pred Value : 0.8880          
##          Neg Pred Value : 0.8835          
##              Prevalence : 0.6060          
##          Detection Rate : 0.5634          
##    Detection Prevalence : 0.6344          
##       Balanced Accuracy : 0.8747          
##                                           
##        'Positive' Class : nonspam         
## 

Although this results in a lower in-sample accuracy, we believe that this tree with a complexity parameter of 4.3% is derived from the CV procedure, will result in the least over-fitting problem.

Random Forest

set.seed(2018)
rf = randomForest(factor(type) ~ ., data = s)
rf
## 
## Call:
##  randomForest(formula = factor(type) ~ ., data = s) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 4.65%
## Confusion matrix:
##         nonspam spam class.error
## nonspam    2709   79  0.02833572
## spam        135 1678  0.07446222

The random forest has an out of bag error rate of 4.65% which corresponds to an out of bag accuracy of 95.4%, a little better than the decision tree’s accuracy.

K-Nearest Neighbour

X = t %>% select(-type)
fitCtrl = trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 10)
set.seed(1)
knnFit1 = train(
  type ~ ., data = spam, 
  method = "knn", 
  trControl = fitCtrl)
knnFit1
## k-Nearest Neighbors 
## 
## 4601 samples
##   58 predictor
##    2 classes: 'nonspam', 'spam' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times) 
## Summary of sample sizes: 3680, 3680, 3682, 3680, 3682, 3681, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.8057154  0.5913807
##   7  0.7993028  0.5773610
##   9  0.7946516  0.5672550
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

Using caret package to choose the most appropriate k value with repeated times 10. As the result shows above, the most accurate k value is 5, which has the corresponding mean accuracy of 10 times 80%.

knn1 = knn(train = X, test = X, cl = spam$type, k = 5)
confusionMatrix(knn1,spam$type)$table
confusionMatrix(knn1,spam$type)$overall[1] %>% round(2)
knn1

Test for the performance for the k value = 5. Obtaining the confusion matrix of the knn model, the accuracy of the k nearest neighbours model with k value 5 is 87%.

Conclusion

The report shows that even though using a random forest approach yields the highest accuracy of 95.4% of correctly identifying the type of email, it has an error rate of 4.65% of wrongly classifying a nonspam email as a spam email. In contrast, using the logistic approach even though yields a slightly lower accuracy rate of 93.2, it has an error rate of 4.3%. To put this in context of the data, a difference of 0.35% resulted in around 16 nonspam emails wrongly classified as spam emails.

References

A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18–22.

Alexandros Karatzoglou, Alex Smola, Kurt Hornik, Achim Zeileis (2004). kernlab - An S4 Package for Kernel Methods in R. Journal of Statistical Software 11(9), 1-20. URL http://www.jstatsoft.org/v11/i09/

Andreas Alfons (2012). cvTools: Cross-validation tools for regression models. R package version 0.3.2. https://CRAN.R-project.org/package=cvTools

Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL http://www.jstatsoft.org/v40/i03/.

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse

Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.1. https://CRAN.R-project.org/package=stargazer

Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan and Tyler Hunt. (2018). caret: Classification and Regression Training. R package version 6.0-80. https://CRAN.R-project.org/package=caret

Sarkar, Deepayan (2008) Lattice: Multivariate Data Visualization with R. Springer, New York. ISBN 978-0-387-75968-5

Stephen Milborrow (2018). rpart.plot: Plot ‘rpart’ Models: An Enhanced Version of ‘plot.rpart’. R package version 3.0.4. https://CRAN.R-project.org/package=rpart.plot

Terry Therneau and Beth Atkinson (2018). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-13. https://CRAN.R-project.org/package=rpart

Torsten Hothorn, Achim Zeileis (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. Journal of Machine Learning Research, 16, 3905-3909. URL http://jmlr.org/papers/v16/hothorn15a.html

Torsten Hothorn, Kurt Hornik and Achim Zeileis (2006). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.

Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0